Bangladesh Medical Association(BMA) member data extraction

Version : 1.0
Date : 2015-05-21

This notebook will illustrate the approach undertaken to extract the BMA doctor's registration. All doctors in Bangladesh recive a registration number at BMA after successfully completing their internship. Using that number they can establish their credivbility as a doctor. Using these numbers one can verify that someone is a legitimate doctor.

Using BMA search portal one can search using only the registration number. But since it is not common in this country to routinely publish their BMA number, we need an interface using which we can search the database using doctor's name also.

Tools used:

Python 2
IPython : python module which provided a python shell for interactive computing within a browser and terminal
Mechanize : python module for interacting with web page and submitting form (Python 2 only module)
Pandas : python module for handling large dataset
Requests: simple HTTP library for python

Unfortunately the data is very barebone at BMA website. Doctor's name, father's name, address and an official photo is provided against each id number. But we can create a master table which we can populate from other sources.

This interface provides us 66000 medical doctor and 4000 dental doctor's worth of information. Currently we have around 70000 doctors in our country. So up can expect data upto couple year ago.

This is a first attempt to collect the data and accumulate them. Several crude hacks were employed to ensure that a working model is up and running as soon as possible. Initially the informations are dumped in a CSV files after we have all the data they will be imported into a PostgreSQL database.

First use of the database might be to implement an mobile app interface where a patient can search for a doctor by his name or registration number and see his photo to verify that he is legit doctor.

Extraction



In [6]:

    
#Load the necessary modules
from mechanize import Browser
import pandas as pd
from IPython.core.display import HTML
import requests

We need a function to parse the HTML data after extracting the result.



In [ ]:

    
def extract_sub_string(string, start, finish):
    """
    extract a substring between the 'start' substring and the first occurence of 'finish' substring after that point.
    
    :param string: main string, to be parsed
    :type string: str
    
    :param start: starting string
    :type start: str
    
    :param end: ending string
    :type end: str
    """
    new_string_index = string.find(start)
    new_string = string[new_string_index:]
    end_index =new_string.find(finish)
    final_string = string[new_string_index:new_string_index+end_index]
    return final_string

Now we extract the result pages against each of the id(1 to 66000) and store the strings in a pandas Dataframe. We will tokenize the resultant string later.



In [ ]:

    
start = 'doctor_info'
finish="</div"
extracted_strings = []
extracted_df = pd.DataFrame(columns=['extracted'])

for reg_no in xrange(1,66001):
    browser = Browser()
    browser.open("http://bmdc.org.bd/doctors-info/")
    for form in browser.forms():
        pass
    # We have 2 forms in this page and we going to select the second form
    browser.select_form(nr=1)
    # This form has 2 input fields, first field, search_doc_id takes an number and second field type indicates if the 
    # id is assocated to a medical doctor or dentist
    form['search_doc_id']=str(reg_no)
    form['type']=['1']
    # Submit the form and read the result
    response = browser.submit()
    content = response.read()
    str_content = str(content)
    #Extract only the relevant portion
    extracted_str = extract_sub_string(str_content, start, finish)
    extracted_strings.append(extracted_str)
    # Originally these commnted out snipppets were run so that each group of 100 doctors are recorded at a time in 
    # seperate csv files. for testing and stability purpose. Each 100 doctors took around 6-7 minutes to record.
    #if reg_no%100==0:
    #    file_number = reg_no/100
    #    extracted_df = pd.DataFrame(columns=['extracted'])
    #    extracted_df.extracted = extracted_strings
    #    extracted_df.to_csv(str(file_number)+'.csv')
    #    extracted_strings = []
extracted_df.extracted = extracted_strings
extracted_df.to_csv('all_bma_doctor.csv')

Parsing

Now upon observation we will see that nugges of information is encapsulated within a specific piece of HTML sting. Using those patterns we can extract the relevant informations.



In [ ]:

    
tokenized_df = pd.DataFrame(columns=['Registration','Name','Father','Address', 'Division'])

#Since originally we created a number of csv files each containing 100 doctors we parsed them differently.
#file_list = []
#for item in xrange(1,66):
#    file_list.append(str(item)+'.csv')
#for file_ in file_list:
    

df = pd.read_csv('all_bma_doctor.csv')
    
for index in df.index:
        string = df.ix[index, 'extracted']

        start="Registration Number</td>\r\n"                      
        finish='</td>\r\n                                  </tr>\r\n\r\n                                  <tr class="odd">\r\n'
        reg_no = extract_sub_string(string , start, finish)
        reg_no = reg_no.strip()
        reg_no = reg_no.split(" ")[-1]
        #reg_no

        start = '<td>Doctor\'s Name</td>\r\n' 
        finish = '</td>\r\n                                  </tr>\r\n'
        dr_name = extract_sub_string(string , start, finish)
        dr_name=dr_name.strip()
        dr_name = dr_name.split(">")[-1]
        #dr_name

        start = "<td>Father's Name</td>"
        finish = "</td>\r\n                                  </tr>"
        father = extract_sub_string(string , start, finish)
        father = father.strip()
        father = father.split(">")[-1]
        #father

        start = '<td> <address> '
        finish = "</address>"
        address = extract_sub_string(string , start, finish)
        address = address.strip()
        address = address.split("<address>")[-1]
        address = address.replace("<br/>",' ').strip()
        #address

        division = 'Medical'

        values = pd.Series()
        values['Registration'] = reg_no
        values['Name'] = dr_name
        values['Father'] = father
        values['Address'] = address
        values['Division'] = division

        tokenized_df.loc[len(tokenized_df)] = values



In [17]:

    
tokenized_df[5000:5010]









    Out[17]:






  
    
      
      Registration
      Name
      Father
      Address
      Division
    
  
  
    
      5000
      5100
      Md. Shah Mizanur Rahman
      NaN
      Dist.- Pirojpur
      Medical
    
    
      5001
      5101
      Momtaz Khanam
      NaN
      73 Sabaybash Dhaka
      Medical
    
    
      5002
      5102
      Santana Chakravarty
      NaN
      Supanighat Dist.- Sylhet
      Medical
    
    
      5003
      5103
      Md. Masudur Rahman
      NaN
      Vill- Sarai Bidyapara Dist.- Rangpur
      Medical
    
    
      5004
      5104
      Md Abdus Salam
      NaN
      Vill- Bhalaipur Dist.- Jessore
      Medical
    
    
      5005
      5105
      Md. Abdul Wadud
      NaN
      47 Dhanmondi R/a Dhaka
      Medical
    
    
      5006
      5106
      Md. Abdul Wadud
      NaN
      47, Dhanmondi R/ A, Road No-3 Dhaka
      Medical
    
    
      5007
      5107
      A. H. M Mushihur Rahman
      NaN
      Vill- Shalikhan Dist.- Bogra
      Medical
    
    
      5008
      5108
      Md. Bazlur Rahman Khan
      NaN
      Vill- Kursatoli Dist.- Tangail
      Medical
    
    
      5009
      5109
      Feroza Begum
      NaN
      Eddalat Para Dist.- Patuakhali
      Medical

Photo extraction

Now we have the information about the doctors. We can also extract the image files containting the photos.



In [15]:

    
for bma_id in xrange(1,66001):
    f = open(str(bma_id)+'.jpg','wb')
    f.write(requests.get('http://bmdc.org.bd/dphotos/medical/'+str(bma_id)+'.JPG').content)
    f.close()

Storing into Database

Until this point the demo work was being done in Django's built-in SQLite database. Now that we have external data source we would be populating a stand-alone databaes so that is can be shared between various apps.

To-Do

Completing the extraction. Until this point, around 16000 doctor's information is extracted in 2 nights. Hopefully over the weekend this process will be completed.
Dump all the data into a database.



In [ ]:

	Registration	Name	Father	Address	Division
5000	5100	Md. Shah Mizanur Rahman	NaN	Dist.- Pirojpur	Medical
5001	5101	Momtaz Khanam	NaN	73 Sabaybash Dhaka	Medical
5002	5102	Santana Chakravarty	NaN	Supanighat Dist.- Sylhet	Medical
5003	5103	Md. Masudur Rahman	NaN	Vill- Sarai Bidyapara Dist.- Rangpur	Medical
5004	5104	Md Abdus Salam	NaN	Vill- Bhalaipur Dist.- Jessore	Medical
5005	5105	Md. Abdul Wadud	NaN	47 Dhanmondi R/a Dhaka	Medical
5006	5106	Md. Abdul Wadud	NaN	47, Dhanmondi R/ A, Road No-3 Dhaka	Medical
5007	5107	A. H. M Mushihur Rahman	NaN	Vill- Shalikhan Dist.- Bogra	Medical
5008	5108	Md. Bazlur Rahman Khan	NaN	Vill- Kursatoli Dist.- Tangail	Medical
5009	5109	Feroza Begum	NaN	Eddalat Para Dist.- Patuakhali	Medical